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Abstract 

Nonparametric  Bayesian  methods  are  considered  for  recovery  of  imagery  based  upon  compressive, 
incomplete  and/or  noisy  measurements.  A  truncated  beta-Bernoulli  process  is  employed  to  infer  an 
appropriate  dictionary  for  the  data  under  test,  and  also  for  image  recovery.  In  the  context  of  compressive 
sensing,  significant  improvements  in  image  recovery  are  manifested  using  learned  dictionaries,  relative 
to  using  standard  orthonormal  image  expansions.  The  compressive-measurement  projections  are  also 
optimized  for  the  learned  dictionary.  Additionally,  we  consider  simpler  (incomplete)  measurements, 
defined  by  measuring  a  subset  of  image  pixels,  selected  uniformly  at  random;  connections  are  made 
to  matrix  completion  and  union-of-subspace  models,  providing  a  link  between  matrix  completion  and 
image  processing.  Spatial  inter-relationships  within  imagery  are  exploited  through  use  of  the  Dirichlet 
and  probit  stick-breaking  processes.  Several  example  results  are  presented,  with  comparisons  to  other 
methods  in  the  literature. 


I.  Introduction 

There  has  been  significant  recent  interest  in  sparse  image  representations,  in  the  context  of  denoising  and 
interpolation  [1],  [13],  [24]— [26],  [28],  [29],  [33],  compressive  sensing  (CS)  [5],  [12],  and  classification 
[40].  All  of  these  applications  exploit  the  fact  that  images  may  be  sparsely  represented  in  an  appropriate 
dictionary.  Most  of  the  denoising,  interpolation,  and  CS  literature  assumes  “off-the-shelf’  wavelet  and 
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DCT  bases/dictionaries  [20],  but  recent  research  has  demonstrated  the  significant  utility  of  learning  an 
often  over-complete  dictionary  matched  to  the  signals  of  interest  ( e.g .,  images)  [1],  [3],  [12],  [13],  [24]— 
[26],  [28],  [29],  [31],  [33],  [41], 

Many  of  the  existing  methods  for  learning  dictionaries  are  based  on  solving  an  optimization  problem 
[1],  [13],  [24]-[26],  [28],  [29],  in  which  one  seeks  to  match  the  dictionary  to  the  imagery  of  interest, 
while  simultaneously  encouraging  a  sparse  representation.  These  methods  have  demonstrated  state-of- 
the-art  performance  for  denoising,  super-resolution,  interpolation,  and  inpainting.  However,  many  existing 
algorithms  for  implementing  such  ideas  also  have  some  restrictions.  For  example,  one  must  often  assume 
access  to  the  noise/residual  variance,  the  size  of  the  dictionary  is  set  a  priori  or  fixed  via  cross-validation 
type  techniques,  and  a  single  (“point”)  estimate  is  learned. 

To  mitigate  the  aforementioned  limitations,  dictionary  learning  has  recently  been  cast  as  a  factor- 
analysis  problem,  with  the  factor  loadings  corresponding  to  the  dictionary  elements  (atoms).  Utilizing 
nonparametric  Bayesian  methods  like  the  beta  process  (BP)  [30],  [38],  [42]  and  the  Indian  buffet  process 
(IBP)  [18],  [21],  one  may  for  example  infer  the  number  of  factors  (dictionary  elements)  needed  to 
fit  the  data  itself.  Further,  one  may  place  a  prior  on  the  noise  or  residual  variance,  with  this  inferred 
from  the  data  [30],  [42],  An  approximation  to  the  full  posterior  may  be  manifested  via  Gibbs  sampling, 
yielding  an  ensemble  of  dictionary  representations.  Recent  research  has  demonstrated  that  an  ensemble 
of  representations  can  be  better  than  a  single  expansion  [14],  with  such  an  ensemble  naturally  manifested 
by  statistical  models  as  the  one  here  described.  Overall,  the  here  proposed  Bayesian  framework  provides 
a  complementary  and  alternative  framework  with  respect  to  the  more  standard  variational  formulations. 
These  can  also  be  interpreted  via  statistical  models  with  solutions  obtained  via  MAP  estimation,  e.g., 
see  [32]  for  an  overview  and  new  interpretation  of  this.  Such  probabilistic  interpretations  use  models 
different  than  the  ones  here  exploited,  and  as  mentioned  above,  have  to  estimate  critical  parameters  and 
produce  single  solutions. 

In  image  analysis  there  is  often  additional  information  that  may  be  exploited  when  learning  dictionaries, 
with  this  well  suited  for  Bayesian  priors.  For  example,  most  natural  images  may  be  segmented,  and 
it  is  probable  that  dictionary  usage  will  be  similar  for  regions  within  a  particular-  segment  class.  To 
address  this  idea,  we  extend  the  model  by  employing  a  probit  stick-breaking  process  (PSBP),  with 
this  a  generalization  of  the  Dirichlet  process  (DP)  stick-breaking  representation  [36].  Related  clustering 
techniques  have  proven  successful  in  image  processing  [27].  The  model  clusters  the  image  patches,  with 
each  cluster  corresponding  to  a  segment  type;  the  PSBP  encourages  proximate  and  similar-  patches  to  be 
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included  within  the  same  segment  type,  thereby  performing  image  segmentation  and  dictionary  learning 
simultaneously. 

As  discussed  when  presenting  results,  the  proposed  method  is  a  natural  tool  for  denoising  images, 
applicable  when  the  noise  statistics  arc  nonstationary.  The  nonstationary  noise  variance  is  inferred  within 
the  analysis.  The  principal  focus  of  this  paper,  however,  is  on  applying  the  algorithms  to  new  compressive 
measurement  techniques  that  have  been  developed  recently.  Specifically,  we  consider  dictionary  learning  in 
the  context  of  compressive  sensing  (CS)  [5],  [10],  in  which  the  measurements  correspond  to  projections 
of  typical  image  pixels.  We  consider  dictionary  learning  performed  “offline”  based  on  representative 
(training)  images,  with  the  learned  dictionary  applied  within  CS  image  recovery.  We  also  consider  the 
case  for  which  the  underlying  dictionary  is  learned  simultaneously  with  inversion  (reconstruction),  with 
this  related  to  “blind”  CS  [17].  Finally,  we  design  the  CS  projection  matrix  to  be  matched  to  the  learned 
dictionary  (when  this  is  done  offline),  and  demonstrate  as  in  [12]  that  in  practice  this  yields  performance 
gains  relative  to  conventional  random  CS  projection  matrices. 

While  CS  is  of  interest  for  its  potential  to  reduce  the  number  of  required  measurements,  it  has  the 
disadvantage  of  requiring  the  development  of  new  classes  of  cameras.  Such  cameras  are  revolutionary 
and  interesting  [11],  [37],  but  there  have  been  decades  of  previous  research  performed  on  development 
of  pixel-based  cameras,  and  it  would  be  desirable  if  such  cameras  could  be  modified  simply  to  perform 
compressive  measurements.  In  that  context,  we  note  that  there  has  been  recent  interest  in  the  field  of 
matrix  completion,  in  which  performance  guarantees  have  been  derived  that  are  similar  to  those  associated 
with  CS  [6].  One  may  view  the  pixel  values  of  an  image  as  a  matrix  of  data,  and  if  one  samples  the 
pixels  uniformly  at  random,  recovery  of  the  missing  pixels  corresponds  to  the  matrix-completion  problem. 
However,  the  matrix-completion  literature  is  based  on  the  assumption  that  the  matrix  of  interest  is  low 
rank  [6],  [22],  [35].  Because  the  underlying  dictionaries  associated  with  natural  images  arc  typically 
over-complete,  the  assumption  of  a  single  low-rank  matrix  of  pixel  values  is  often  inappropriate. 

While  a  direct  application  of  matrix-completion  technology  to  this  problem  is  then  inappropriate,  it 
may  be  modified  simply  such  that  it  is  useful.  Specifically,  because  natural  images  manifest  segments  and 
self-similarity,  one  may  view  the  dictionary-learning  framework  within  a  union-of-subspaces  setting  [15], 
[23].  Each  subspace,  defined  by  a  subset  of  the  dictionary,  represents  a  class  of  local  structure  within 
an  image,  and  each  subspace  and  associated  data  may  be  viewed  as  a  low-rank  matrix.  The  BP,  DP  and 
PSBP  models  discussed  above  are  employed  to  perform  joint  clustering  and  recovery  of  missing  pixels, 
with  comparisons  made  to  related  non-Bayesian  approaches. 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  II  we  review  the  classes  of  problems 
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being  considered.  The  beta-Bernoulli  process  is  discussed  in  Section  III,  with  relationships  made  with 
previous  work  in  this  area,  including  those  based  on  the  Indian  buffet  process.  The  Dirichlet  and  probit 
stick-breaking  processes  arc  discussed  in  Section  IV,  and  several  example  results  are  presented  in  Section 
V.  Conclusions  and  a  discussion  of  future  work  arc  provided  in  Section  VI,  and  details  of  the  inference 
equations  are  summarized  in  the  Appendix. 

II.  Problems  Under  Study 
A.  Denoising  and  compressive  sensing 

We  consider  data  samples  that  may  be  expressed  in  the  form 

x  i  =  D  Wi  +  e  i  (1) 

where  x,  G  Mp,  et  G  Mp,  and  w,  G  RA.  The  columns  of  the  matrix  D  G  WPxK  represent  the 
K  components  of  a  dictionary  with  which  x,  is  expanded.  For  our  problem,  the  x,  will  correspond 
to  B  x  B  (overlapped)  pixel  patches  in  an  image  [1],  [13],  [24],  [25],  [28],  [42],  The  set  of  vectors 
{aq}i=r,jv  may  be  extracted  from  an  image(s)  of  interest. 

For  the  denoising  problem,  the  vectors  c,  may  represent  sensor  noise,  in  addition  to  (ideally  small) 
residual  from  representation  of  the  underlying  signal  as  D  m,  .  To  perform  denoising,  we  place  restrictions 
on  the  vectors  Wi,  such  that  Dm,;  by  itself  does  not  exactly  represent  .x,.  A  popular  such  restriction  is 
that  Wi  should  be  sparse,  motivated  by  the  idea  that  any  particular  x,  may  often  be  represented  in  terms 
of  a  small  subset  of  representative  dictionary  elements,  from  the  full  dictionary  defined  by  the  columns  of 
D.  There  are  several  methods  that  have  been  developed  recently  to  impose  such  a  sparse  representation, 
including  (|  -based  relaxation  algorithms  [24],  [25],  iterative  algorithms  [1],  [13],  and  Bayesian  methods 
[42],  One  advantage  of  a  Bayesian  approach  is  that  the  noise/residual  statistics  may  be  nonstationary 
(with  unknown  noise  statistics).  Specifically,  in  addition  to  placing  a  sparscncss-pro rooting  prior  on  w., , 
we  may  also  impose  a  prior  on  the  components  of  et.  From  the  estimated  posterior  density  function 
on  model  parameters,  each  component  of  et,  corresponding  to  the  / th  B  x  B  image  patch,  has  its  own 
variance.  Given  our  goal  may  be  to  simultaneously  infer  D  and  {'fr>,},=  pJv  (and  implicitly 

e,),  and  then  the  denoised  version  of  x,  is  represented  as  Dtiq. 

In  many  applications  the  total  number  of  pixels  N  ■  P  may  be  large.  However,  it  is  well  known 
that  compression  algorithms  may  be  used  on  {x i}i=i,N  after  the  measurements  have  been  performed, 
to  significantly  reduce  the  quantity  of  data  that  need  be  stored  or  communicated.  This  compression 
indicates  that  while  the  data  dimensionality  N  ■  P  may  be  large,  the  underlying  information  content  may 
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be  relatively  low.  This  has  motivated  the  field  of  compressive  sensing  [5],  [10],  [11],  [37],  in  which  the 
total  number  of  measurements  performed  may  be  much  less  than  N  ■  P.  Toward  this  end,  researchers 
have  proposed  projection  measurements  of  the  form 

Vi  =  See,;  (2) 

where  £  £  MnxP  and  yi  £  Mn,  ideally  with  n  <C  P.  The  projection  matrix  £  has  traditionally  been 
constituted  randomly  [5],  [10],  with  a  binary  or  real  alphabet  (and  £  may  also  be  a  function  of  the 
specific  patch,  and  generalized  as  £,;).  It  is  desirable  that  matrices  £  and  D  be  as  incoherent  as  possible. 

The  recovery  of  x,  from  yi  is  an  ill-posed  problem  unless  restrictions  arc  placed  on  xt.  We  may 
exploit  the  same  class  of  restrictions  used  in  the  denoising  problem;  specifically,  the  observed  data 
satisfy  yt  =  <f >Wi  +  17,  with  $  =  £D  and  17  =  £ej,  and  with  sparse  uq.  Note  that  the  sparseness 
constraint  implies  that  {wi}i=\}N  (and  hence  {.'e,},=  n/v)  occupy  nonlinear  subspaces  of  Mp. 

In  most  applications  of  compressive  sensing  D  is  assumed  known,  corresponding  to  an  orthonormal 
basis  (e.g.,  wavelets  or  a  DCT)  [5],  [10],  [20].  However,  such  bases  are  not  necessarily  well  matched  to 
natural  imagery,  and  it  is  desirable  to  consider  design  of  dictionaries  D  for  this  purpose  [12].  One  may 
even  consider  recovering  {xi}i= from  {?/j}i=i,jv  while  simultaneously  inferring  D.  Thus,  we  again 
have  a  dictionary-learning  problem,  which  may  be  coupled  with  optimization  of  the  CS  matrix  £,  such 
that  it  is  matched  to  D  (defined  by  a  low  coherence  between  the  rows  of  £  and  columns  of  D  [5],  [10], 
[12],  [20]). 

B.  Matrix  completion  and  pixel  recovery 

Consider  a  matrix  M  £  W1'  xr'2  of  rank  r,  and  assume  we  only  observe  m  -C  n\  ■  112  components  of 
this  matrix,  with  the  observed  components  selected  uniformly  at  random.  Let  O  represent  a  set  defining 
the  entries  of  M  for  which  we  have  data.  Consider  the  convex  program 

minimize  j|Z||*  (3) 

subject  to  Pn( Z)  =  Pn( M)  (4) 

where  Pq(-)  defines  the  vector  of  samples  of  the  associated  matrix  that  are  in  the  set  Q.  The  nuclear 
norm  ||Z||*  =  JT  ^ ( Z ) ,  where  7,; ( Z )  represents  the  / th  singular  value  of  Z.  Defining  n  =  max(ni,n2), 
Candes  and  Tao  [6]  showed  that  with  probability  exceeding  1  —  n-3,  if  m  >  C p2nr(logn)6  then  Z  =  M, 
thereby  recovering  the  missing  entries  of  M  (  [8]  considers  related  issues).  The  parameter  //  represents 
the  coherence,  a  measure  of  how  “spread  out”  the  singular  vectors  of  M  are,  and  ideally  the  coherence 
is  near  one.  For  this  result  to  be  useful,  we  require  r  <C  ni  and  r  <7  m. 
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The  assumption  that  M  is  low  rank,  and  the  use  of  this  in  recovering  missing  matrix  entries,  implies 
that  the  columns  (and  rows)  of  M  reside  in  an  r-dimensional  linear  subspace.  Specifically, 

r 

M  =  ^2  ^iuivI  (5) 

i=  1 

where  A j  G  M,  u,  G  Mni ,  and  v,  G  Mn2  represent,  respectively,  the  ith  singular  value,  left  singular  vector, 
and  right  singular  vector.  Hence,  each  column  of  M  is  assumed  to  reside  in  a  linear  subspace  defined 
by  {ui}i=hr. 

We  now  reconsider  the  set  of  vectors  discussed  above,  again  with  re*  G  Mp  corresponding 

to  patches  of  pixels  from  the  image(s).  Let  X  G  RPxN  have  columns  defined  by  the  vectors 
It  is  assumed  that  a  randomly  constituted  subset  of  the  components  of  X  arc  measured,  and  our  goal  is  to 
recover  the  missing  data  using  ideas  analogous  to  those  employed  in  matrix-completion  theory.  However, 
as  discussed  above,  each  xt  =  Dm*  +  er,  for  sparse  m,.  If  we  ignore  e,  for  now,  the  columns  of  X 
reside  in  a  nonlinear  subspace  of  Mp,  due  to  the  fact  that  all  wt  are  sparse  (not  necessarily  with  the 
same  sparsity  patterns).  Further,  since  the  dictionary  D  is  typically  over-complete,  X  is  generally  not 
low-rank.  Therefore,  linear  matrix-completion  theory  is  not  applicable  to  X. 

While  a  direct  application  of  this  theory  may  be  inappropriate  for  image-processing  problems,  the 
framework  may  be  modified  to  make  it  applicable.  As  one  way  in  which  the  model  may  be  modified, 
recall  that  images  tend  to  possess  significant  self-similarity  [4]  and  segments,  implying  that  many  B  x  B 
patches  have  similar  structure.  This  suggests  that  there  may  be  a  clustering  of  the  B  x  B  blocks,  and  that 
within  each  cluster  the  associated  data  can  constitute  the  columns  of  a  matrix  of  low-rank,  recoverable  in 
the  presence  of  significant  missing  matrix  values.  The  nonparametric  Bayesian  methods  developed  here 
implement  joint  clustering  of  the  observed  image  patches  and  inference  of  the  missing  pixel  values. 

III.  Sparse  Factor  Analysis  with  the  Beta-Bernoulli  Process 

When  presenting  example  results,  we  will  consider  three  problems.  For  demising,  it  is  assumed  we 
measure  Xi  =  Dm,  +  e,;,  where  e,  represents  measurement  noise  and  model  error;  for  the  compressive- 
sensing  application  we  observe  yi  =  S(DiOj  +  e,.)  =  <S>w,  +  Uj,  with  $  =  SD  and  u,  =  <Fe,;  and, 
finally,  for  the  interpolation  problem  we  observe  P?(Dw,  +  e,. ) ,  where  P0(xt)  is  a  vector  of  elements 
from  x,  contained  within  the  set  <f>,  as  in  (4).  For  all  three  problems  our  objective  is  to  infer  the  underlying 
signal  Dm*,  with  w,  assumed  sparse;  we  generally  wish  to  simultaneously  infer  D  and  {m,},-!^.  To 
address  each  of  these  problems,  we  consider  a  statistical  model  for  Xi  =  Dm,  +  e,,  placing  Bayesian 
priors  on  D,  m*  and  e, ;  the  way  the  model  is  used  is  modified  slightly  for  each  specific  application.  For 
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example,  when  considering  interpolation,  only  the  observed  P^xf)  arc  used  within  the  model  likelihood 
function. 

A.  Beta-Bemoulli  process  for  active-set  selection 

Let  the  binary  vector  z*  e  {0, 1 } K  denote  which  of  the  K  columns  of  D  arc  used  for  representation 
of  Xi  (active  set);  if  a  particular  component  of  z%  is  equal  to  one,  then  the  corresponding  column  of  D 
is  used  in  the  representation  of  ®j.  Hence,  for  the  data  {.x,;}?=  |  there  is  an  associated  set  of  latent 
binary  vectors  {zjjj^qjv,  and  the  beta-Bernoulli  process  provides  a  convenient  prior  for  these  vectors 
[30],  [38],  [42],  Specifically,  consider  the  model 

K 

n  Bcrnoulli(7T/;.) 

k= l 
K 

7r  ~  Y[Beta(a/K,b(K  -  l)/K)  (6) 

k= l 

where  7 is  the  kth  component  of  7r,  and  a  and  6  arc  model  parameters;  the  impact  of  these  parameters 
on  the  model  arc  discussed  below. 

Considering  the  limit  K  — >  oo,  and  after  integrating  out  7r,  the  draws  of  {zi}i=i,jv  may  be  constituted  as 
follows.  For  each  zt,  draw  c,  ~  Poisson(6+?_1)  and  define  C,  =  Yl}= i  cv  with  Co  =  0.  Let  z,j.  represent 
the  kth  component  of  z*,  and  Zik  =  0  for  k  >  Ci.  For  k  =  1, . . . ,  Ci- i,  Zik  ~  Bernoulli where 
nik  =  E}=1  zjk  ( tiik  represents  the  total  number  of  times  the  kth  component  of  { z;  }?  =  ]  ,_  ]  is  one).  For 
k  =  Ci- 1  +  1, . . . ,  C{,  we  set  z^  =  1.  Note  that  as  a/(b  +  %  —  1)  becomes  small,  with  increasing  i,  it  is 
probable  that  a  will  be  small.  Hence,  with  increasing  i,  the  number  of  new  non-zero  components  of  z,; 
diminishes.  Further,  as  a  consequence  of  Bernoulli when  a  particular  component  of  the  vectors 
is  frequently  one,  it  is  more  probable  that  it  will  be  one  for  subsequent  Zj,  j  >  i.  When 
6  =  1  this  construction  for  {zi}i- i;jv  corresponds  to  the  Indian  buffet  process  [18]. 

Since  z,;  defines  which  columns  of  D  are  used  to  represent  xt,  (6)  imposes  that  it  is  probable  that 
some  columns  of  D  are  used  repeatedly  among  the  set  {®j}i= i  while  other  columns  of  D  may  be 
more  specialized  to  particular'  aq.  As  demonsttated  below,  this  has  been  found  to  be  a  good  model  when 
i  v  ai'e  patches  of  pixels  extracted  from  natural  images. 
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B.  Full  hierarchical  model 


The  hierarchical  form  of  the  model  may  now  be  expressed  as 

Xi  =  D  Wi  +  e  i 

Wi  =  Zi  ©  Si 

dk  ~  AA(0,  P~1Ip) 

Si  ~ 

€i  ~  AA(0,7e_1Ip)  (7) 


where  dk  represents  the  kth  component  (atom)  of  D,  o  represents  the  pointwise  or  Hadamard  vector 
product.  Ip  (I k)  represents  a  P  x  P  (K  x  I\)  identity  matrix,  and  {z,},=  p/y  are  drawn  as  in  (6). 
Conjugate  hypeipriors  js  ~  Ganmia(c,  d)  and  ~  Gamma(e,  /)  are  also  imposed.  The  construction  in 
(7),  and  with  the  prior  in  (6)  for  {zi}i=itN,  is  henceforth  referred  to  as  the  beta  process  factor  analysis 
(BPFA)  model. 

Note  that  we  impose  independent  Gaussian  priors  for  dk.  Si  and  e,  for  modeling  convenience  (conju- 
gacy  of  consecutive  terms  in  the  hierarchical  model).  However,  the  inferred  posterior  for  these  terms  is 
generally  not  independent  or  Gaussian.  The  independent  priors  essentially  impose  prior  information  about 
the  marginals  of  the  posterior  of  each  component,  while  the  inferred  posterior  accounts  for  statistical 
dependence  as  reflected  in  the  data.  To  make  connections  of  this  model  to  more-typical  optimization-based 


approaches  [24],  [25],  note  that  the  negative  logarithm  of  the  posterior  density  function  is 


-log  p(@\V,H)  =  ^^\\xi-V(siozi)\\l  + ~^\\dk\\l  +  ^Y^ 


-  log  fBeta-Bem({zi}^=1-,Tl)  -  log  Ganmia(7e \H)  -  log  Gamma(7s|77)  +  Const. 


where  0  represents  all  unknown  model  parameters,  V  =  {*,:},:=i,at,  fBeta-Bern{{zi}iL pTf)  represents 
the  beta-Bernoulli  process  prior  in  (6),  and  H  represents  model  hyper-parameters  (i.e.,  a,  b,  c,  d,  e  and 
/).  Therefore,  the  typical  1 2  constraints  [24],  [25]  on  the  dictionary  elements  dk  and  on  the  non-zero 
weights  si  correspond  here  to  the  Gaussian  priors  employed  in  (7).  However,  rather  than  an  employing 
an  i\  (Laplacian  prior)  constraint  [24],  [25]  to  impose  sparseness  on  wt.  we  employ  the  beta-Bernoulli 
process  and  Wi  =  Sj  OZj.  The  beta-Bernoulli  process  imposes  that  the  binary  zt  should  be  sparse,  and  that 
there  should  be  a  relatively  consistent  (re)use  of  dictionary  elements  across  the  image,  thereby  imposing 
self-similarity.  Further,  and  perhaps  most  importantly,  we  do  not  constitute  a  point  estimate,  as  one  would 
do  if  a  single  ©  was  sought  to  minimize  (8).  We  rather  estimate  the  full  posterior  density  p(@\'D,H), 
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implemented  via  Gibbs  sampling.  A  significant  advantage  of  the  hierarchical  construction  in  (7)  is  that 
each  Gibbs  update  equation  is  analytic,  with  detailed  update  equations  provided  in  Appendix  B.  Note  that 
consistent  use  of  atoms  is  encouraged  because  the  active  sets  arc  defined  by  the  binary  vectors  {z*}i= qjv, 
and  these  are  all  drawn  from  a  shared  probability  vector  7r;  this  is  distinct  from  drawing  the  active  sets 
i.i.d.  from  a  Laplacian  prior.  Further,  the  beta-Bernoulli  prior  imposes  that  many  components  of  are 
exactly  zero,  while  with  a  Laplacian  prior  many  components  arc  small  but  not  exactly  zero. 

IV.  Patch  Clustering  via  Dirichlet  and  Probit  Stick-Breaking  Processes 
A.  Dirichlet  process 

As  discussed  in  Section  II-B,  in  many  applications  it  is  expected  that  the  data  patches  {ccj}i=i,Ar  may 
cluster,  and  it  is  of  interest  to  infer  this  clustering  nonparametrically  (i.e.,  to  infer  the  number  of  clusters 
and  their  composition  from  the  data).  The  imposition  of  such  prior  knowledge  may  improve  the  quality 
of  the  inversion  for  {aij}i=i,Ar  based  upon  incomplete  measurements.  The  Dirichlet  process  (DP)  [16] 
constitutes  a  popular  means  of  performing  such  nonparametric  clustering.  A  random  draw  from  a  DP, 
G  ~  DP(aG'o),  with  precision  a  £  M+  and  “base”  measure  Go,  may  be  constituted  via  the  stick-breaking 
construction  [36] 

OO 

G  =  J2ft6o;  >  ei~Go  (9) 

i=i 

where  (3t  =  Vx  nl=\(l  -  Vh)  and  Vh  Beta(l,  a).  The  (3i  may  be  viewed  as  a  sequence  of  fractional 
breaks  from  a  “stick”  of  original  length  one,  where  the  fraction  of  stick  broken  off  on  break  I  is  l  /.  The 
(ij  are  model  parameters,  associated  with  the  / tli  data  cluster.  For  our  problem  it  has  proven  effective  to 
set  Go  =  nf=i  Beta (a/K,  b(K  —  1  )/K)  analogous  to  (6),  and  hence  G  =  Yli^i  The  drawn 

from  Go,  correspond  to  distinct  probability  vectors  for  using  the  K  dictionary  elements  (columns  of  D). 
For  sample  i  we  draw  7r(  ~  G,  and  a  separate  sparse  binary  vector  z,  is  drawn  for  each  sample  xt,  as 
Zi  ~  nti  BemouUi(7Tjfc),  with  tt,/.  the  kth  component  of  7 rq.  In  practice  we  truncate  the  infinite  sum  for 
G  to  Nl  elements,  and  impose  Vnl  =  1,  such  that  J2i=i  A  =  1.  A  (conjugate)  gamma  prior  is  placed 
on  the  DP  parameter  a. 

We  may  view  this  DP  construction  as  an  “Indian  buffet  franchise,”  generalizing  the  Indian  buffet 
analogy  [18].  Specifically,  there  are  Nl  Indian  buffet  restaurants;  each  restaurant  is  composed  of  the 
same  “menu”  (columns  of  D),  and  is  distinguished  by  different  probabilities  for  selecting  menu  items. 
The  “customers”  {ccj}i=i,rv  cluster  based  upon  which  restaurant  they  go  to.  The  {nj  }/=i,a-;,  represent 
the  probability  of  using  each  column  of  D  in  the  respective  Nl  different  buffets.  The  {ccj}i=i,jv  cluster 


April  17,  2010 


DRAFT 


10 


themselves  among  the  different  restaurants  in  a  manner  that  is  consistent  with  the  characteristics  of  the 
data,  with  the  model  also  simultaneously  learning  the  dictionary/menu  D.  Note  that  we  typically  make 
the  truncation  Nl  large,  and  the  posterior  distribution  infers  the  number  of  clusters  actually  needed  to 
support  the  data,  as  represented  by  how  many  /?/  are  of  significant  value.  The  model  in  (7),  with  the 
above  DP  construction  for  {zi}i=i,Ar,  is  henceforth  referred  to  as  DP-BPFA. 

B.  Probit  stick-breaking  process 

The  DP  yields  a  clustering  of  {xi}i=\tN,  but  it  does  not  account  for  our  knowledge  of  the  location 
of  each  patch  within  the  image.  It  is  natural  to  expect  that  if  x%  and  xt>  are  proximate  then  they  are 
likely  to  be  constituted  in  terms  of  similar  columns  of  D  1 .  To  impose  this  information,  we  employ  the 
probit  stick-breaking  process  (PSBP).  A  logistic  stick-breaking  process  is  discussed  in  detail  in  [34],  We 
employ  the  closely  related  probit  version  here  because  it  may  be  easily  implemented  in  a  Gibbs  sampler. 
We  note  that  while  the  method  in  [34]  is  related  to  that  discussed  below,  in  [34]  the  concepts  of  learned 
dictionaries  and  beta-Bernoulli  priors  were  not  considered. 

We  augment  the  data  as  {ccj,  ri}i=itN,  where  x,  again  represents  pixel  values  from  the  7 1 h  image  patch, 
and  Vi  £  R2  represents  the  two-dimensional  location  of  each  patch.  We  wish  to  impose  that  proximate 
patches  are  more  likely  to  be  composed  of  the  same  or  similar  columns  of  D.  In  the  PSBP  construction, 
all  aspects  of  (7)  are  retained,  except  for  the  manner  in  which  z*  are  constituted.  Rather  than  drawing  a 

single  A'-dimcnsional  vector  of  probabilities  n  as  in  (6),  we  draw  a  library  of  such  vectors: 

I< 

7r*  ~  JJ  Beta (a/K,  b(K  -  1  )/K)  ,  1  =  1,...,  NL  (10) 

k= 1 

and  each  it]  is  associated  with  a  particular  segment  in  the  image.  One  7 r*  is  associated  with  location  t*, 
and  drawn 

Nl 

i  ~  ^  (11) 

1=1 

with  J2k=i  Pl(ri )  =  1  f°r  aH  ru  and  ()n-  represents  a  point  measure  concentrated  at  7tJ\  Once  7r,  is 
associated  with  a  particular  Xi,  the  corresponding  binary  vector  z,  is  drawn  as  in  the  first  line  of  (6). 
Note  that  the  distinction  between  DP  and  PSBP  is  that  in  the  former  the  mixture  weights  {/3i}i=\  nl  are 
independent  of  spatial  position  r,  while  the  latter  explicitly  utilizes  r  within  {A(r)}/=i,JVi,  (and  below 
we  impose  that  (3i(r)  changes  smoothly  with  r). 

'Proximity  can  be  modeled  as  in  “spatial  proximity,”  as  here  developed  in  detail,  or  “feature  proximity”  as  in  non-local  means 
and  related  approaches,  see  [27]  and  references  therein. 
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The  space-dependent  weights  are  constructed  as  fii(r)  =  Vi(r)  nUjl  —  V/,(r)]  where  0  <  V\(r)  <  1 
constitute  space-dependent  probabilities.  We  set  Vnl  =  1,  and  for  l  <  Nl  —  1  the  V/  are  space-dependent 
probit  functions: 

f9i{r) 

Vi(r)=  dxAf(x\0,l),  gi{r)  =  Q0  +  ^2cHIC(r,ri-,^i)  (12) 

J-OO  i=1 

where  JC(r,rf,ij)i)  is  a  kernel  characterized  by  parameter  ipi  and  {Oi}i=o,iv  are  a  sparse  set  of  real 
numbers.  To  implement  the  sparseness  on  {Cu}i=o,N,  within  the  prior  ~  A/"(0,  aT- 1 ),  and  (conjugate) 
an  ~  Gamma(ao,  bo),  with  (ao,  bo)  set  to  favor  most  an  being  large  (if  an  is  large,  a  draw  J\f(0,  af1)  is 
likely  to  be  near  zero,  such  that  most  {Oi}i=o,iv  are  near  zero).  This  sparseness-promoting  construction 
is  the  same  as  that  employed  in  the  relevance  vector  machine  (RVM)  [39] .  We  here  utilize  a  radial  basis 
function  (RBF)  kernel  /C(r,  ipi)  =  exp[—  ||r,  —  r^/'ipi}. 

Each  gi(r)  is  encouraged  to  only  be  defined  by  a  small  set  of  localized  kernel  functions,  and  via  the 
probit  link  function  dxj\f(x\0,  1)  the  probability  Vfr)  is  characterized  by  localized  segments  over 
which  the  probability  Vj(r)  is  contiguous  and  smoothly  varying.  The  Vj(r)  constitute  a  space-dependent 
stick-breaking  process.  Since  Vnl  =  1,  ^/V=i  fti(r)  =  1  f°r  all  r- 

The  PSBP  model  is  relatively  simple  to  implement  within  a  Gibbs  sampler.  For  example,  as  indicated 
above,  sparseness  on  is  imposed  as  in  the  RVM,  and  the  probit  link  function  is  simply  implemented 
within  a  Gibbs  sampler  (which  is  why  it  was  selected,  rather  than  a  logistic  link  function).  Finally,  we 
define  a  finite  set  of  possible  kernel  parameters  {ipj}j=i,Np,  and  a  multinomial  prior  is  placed  on  these 
parameters,  with  the  multinomial  probability  vector  drawn  from  a  Dirichlet  distribution  [34]  (each  of  the 
gi(r)  draws  a  kernel  parameter  from  {^j}j=i,Np)-  The  model  in  (7),  with  the  PSBP  construction  for 
{zi}i= itN,  is  henceforth  referred  to  as  PSBP-BPFA. 

C.  Discussion  of  proposed  sparseness-imposing  priors 

The  basic  BPFA  model  is  summarized  in  (7),  and  three  related  priors  have  been  developed  for  the 
sparse  binary  vectors  {zt}l=  |  ;y:  (i)  the  basic  truncated  beta-Bernoulli  process  in  (6),  (ii)  a  DP-based 
clustering  of  the  underlying  {7^}*=!^,  and  (in)  a  PSBP  clustering  of  {7r,}i=iv  that  exploits  knowledge 
of  the  location  of  the  image  patches.  For  (ii)  and  (in),  the  x,  within  a  particular  cluster  have  similar 
Zi,  rather  than  exactly  the  same  binary  vector;  we  also  considered  the  latter,  but  this  worked  less  well  in 
practice.  As  discussed  further  when  presenting  results,  for  denoising  and  interpolation,  all  three  methods 
yield  comparable  performance.  However,  for  CS,  (ii)  and  (in)  yield  marked  improvements  in  image- 
recovery  accuracy  relative  to  (i).  In  anticipation  of  these  results,  we  provide  a  further  discussion  of  the 
three  priors  on  {zi}i= 1  jy  and  on  the  three  image-processing  problems  under  consideration. 
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For  the  denoising  and  interpolation  problems,  we  arc  provided  with  the  data  {aqj^qjv,  albeit  in  the 
presence  of  noise  and  potentially  with  substantial  missing  pixels.  However,  for  this  problem  N  may 
be  made  quite  large,  since  we  may  consider  all  possible  (overlapping)  B  x  B  patches.  A  given  pixel 
(apart  from  near  the  edges  of  the  image)  is  present  in  B2  different  patches.  Perhaps  because  we  have 
such  a  large  quantity  of  partially  overlapping  data,  for  denoising  and  interpolation  we  have  found  that 
beta-Bernoulli  process  in  (6)  is  sufficient  for  inferring  the  underlying  relationships  between  the  different 
data  {xi}i=itN,  and  processing  these  data  collaboratively.  However,  the  beta-Bernoulli  construction  does 
not  explicitly  segment  the  image,  and  therefore  an  advantage  of  the  PSBP-BPFA  construction  is  that  it 
yields  comparable  denoising  and  interpolation  performance  as  (6),  while  also  simultaneously  yielding  an 
effective  image  segmentation. 

For  the  CS  problem,  we  measure  yi  =  Saq,  and  therefore  each  of  the  n  measurements  associated  with 
each  image  patch  (51  e  Rnx  p)  loses  the  original  pixels  in  xL  (the  projection  matrix  X  may  also  change 
with  each  patch,  denoted  £,;).  Therefore,  for  CS  one  cannot  consider  all  possible  shifts  of  the  patches,  as 
the  patches  arc  predefined  and  fixed  in  the  CS  measurement  (in  the  denoising  and  interpolation  problems 
the  patches  arc  defined  in  the  subsequent  analysis).  Therefore,  for  CS  imposition  of  the  clustering  behavior 
via  DP  or  PSBP  provides  important  information,  yielding  state-of-the-art  CS-recovery  results. 

Before  proceeding  to  the  results,  we  also  reconsider  the  theorem  in  Section  II-B.  It  proved  convenient, 
such  that  existing  matrix-completion  theory  could  be  readily  applied,  to  assume  that  the  latent  binary 
vectors  {z,:}j= qjv  clustered.  While  that  analysis  demonstrated  the  promise  of  recovering  missing  pixels  in 
natural  images,  our  empirical  results  suggest  that  a  more-thorough  theoretical  analysis  of  this  problem  is 
needed.  Assume  we  measure  x\  =  P^.  (xt),  where  (pt  denotes  the  set  of  pixels  observed  for  x,  (different 
subsets  of  pixels  are  observed  for  different  patches,  and  by  processing  all  pixels  “collaboratively,”  we 


may  infer  the  underlying  dictionary  D  and  the  sparse  w,).  The  logarithm  of  the  posterior  of  (all)  model 
parameters  ©,  given  observed  data  V  =  {;r;'},;=l./v  and  model  hyperparameters  H,  may  be  expressed  as 


-log  p(@\V,H)  =  '^-^j\\P^{.Xi-D{siozi))\\l  + -^2\\dk\\l  +  ^-^\\s, 


—  log  f  {{zij^L^TL)  —  log  Gamma  (7,=  |  H)  —  logGamma(7s|?f)  +  Const. 


where  density  function  /{{z^fL^Tt)  represents  the  particular  prior  placed  on  {zj}j= qjv,  and  we  have 
considered  the  beta-Bernoulli  prior,  as  well  as  the  DP  and  PSBP  constructions.  Note  that  the  Gaussian 
assumption  on  e,  yields  a  Frobenius  norm  between  the  observed  image  data  and  that  manifested  by  the 
model  (constituted  via  YliLi  |  P4, ,  ( x *  —  D  ( s ,  °  2 1 )  j  1 1  ) .  Therefore,  this  construction  is  closely  linked  to  the 
optimization  approach  advocated  for  near-low -rank  matrix  completion  [7],  which  also  uses  a  Frobenius 
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norm.  The  key  distinction,  however,  is  manifested  here  via  which  moves  beyond  linear 

(low-rank)  models  to  nonlinear  constructions  (the  subspace  used  to  represent  missing  data  is  patch- 
dependent). 


V.  Example  Results 

A.  Reproducible  research 

The  test  results  and  the  Matlab  code  to  reproduce  them  can  be  downloaded  from  http  :  / / www .ee.duke.edu/  ~ 
mzl / Results  /  B  P  F  Almage / . 

B.  Parameter  settings 

For  all  BPFA,  DP-BPFA  and  PSBP-BPFA  computations,  the  dictionary  truncation  level  was  set  at 
K  =  256  or  K  =  512  based  on  the  size  of  the  image.  Not  all  K  dictionary  elements  are  used  in 
the  model;  the  truncated  beta-Bernoulli  process  infers  the  subset  of  dictionary  elements  employed  to 
represent  the  data  {ccj}j=ipv-  The  number  of  DP  and  PSBP  sticks  was  set  at  Nl  =  20.  The  library 
of  PSBP  parameters  is  defined  as  in  [34],  The  hyperparameters  within  the  gamma  distributions  were 
set  as  c  =  d=  e  =  f  =  10-6,  as  is  typically  done  in  models  of  this  type  [39]  (the  same  settings 
were  used  for  the  gamma  prior  for  the  DP  precision  parameter  a).  The  beta-distribution  parameters  are 
set  as  a  =  K  and  b  =  1  if  random  initialization  is  used  or  a  =  K  and  b  =  N/H  if  a  singular  value 
decomposition  (SVD)  based  initialization  is  used.  None  of  these  parameters  have  been  optimized  or 
tuned.  When  performing  inference,  all  parameters  are  initialized  randomly  (as  a  draw  from  the  associated 
prior)  or  based  on  the  SVD  of  the  image  under  test.  The  Gibbs  samplers  for  the  BPFA,  DP-BPFA  and 
PSBP-BPFA  have  been  found  to  mix  and  converge  quickly,  producing  satisfactory  results  with  as  few 
as  20  iterations.  The  inferred  images  represent  the  average  from  the  collection  samples.  All  software 
was  written  in  non-optimized  Matlab.  On  a  Dell  Precision  T3500  computer  with  a  2.4  GHz  CPU,  for 
N  =  148,  836  patches  of  size  8x8x3  with  20%  of  the  RGB  pixels  observed  at  random,  the  BPFA 
required  about  2  minutes  per  Gibbs  iteration  (the  DP  version  was  comparable),  and  PSBP-BPFA  required 
about  3  minutes  per  iteration.  For  the  106-band  hyperspectral  imagery,  which  employed  N  =  428, 578 
patches  of  size  4  x  4  x  106  with  2%  of  the  voxels  observed  uniformly  at  random,  each  Gibbs  iteration 
required  about  15  minutes. 

C.  Denoising 

The  BPFA  denoising  algorithm  is  compared  with  the  original  KSVD  [13],  for  both  grey-scale  and  color 
images.  Newer  denoising  algorithms  include  block  matching  with  3D  filtering  (BM3D)  [9],  the  multiscale 
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KSVD  [29],  and  KSVD  with  the  non-local  mean  constraints  [26].  These  algorithms  assume  the  noise 
variance  is  known,  while  the  proposed  model  automatically  infers  the  noise  variance  from  the  image  under 
test.  Moreover,  the  BPFA,  DP-BPFA  and  PSBP-BPFA  models  infer  a  potentially  non-stationary  variance, 
with  a  broad  prior  on  the  variance  imposed  by  the  gamma  distribution.  In  the  denoising  examples  we 
consider  the  BPFA  model  in  (6);  similar  results  are  obtained  via  the  DP-BPFA  and  PSBP-BPFA  models 
discussed  in  Section  IV. 

In  Table  I  we  consider  images  from  [13].  The  proposed  BPFA  performs  very  similarly  to  KSVD.  As 
one  representative  example  of  the  model’s  ability  to  infer  the  noise  variance,  we  consider  the  Lena  image 
from  Table  I.  The  mean  inferred  noise  standard  deviations  are  5.83,  10.59,  15.53,  20.48,  25.44,  50.46 
and  100.54  for  images  contaminated  by  noise  with  respective  standard  deviations  of  5,  10,  15,  20,  25,  50 
and  100.  Each  of  these  noise  variances  were  automatically  inferred  using  exactly  the  same  model,  with 
no  changes  to  the  gamma  hyperparameters. 

In  Table  II  we  present  similar  results,  for  denoising  RGB  images;  the  KSVD  comparisons  come  from 
[28].  An  example  denoising  result  is  shown  in  Figure  1.  As  another  example  of  the  BPFA’s  ability  to 
infer  the  underlying  noise  variance,  for  the  castle  image,  the  mean  (automatically)  inferred  variances  are 
5.15,  10.18,  15.22  and  25.23  for  images  with  additive  noise  with  true  respective  standard  deviations  5, 
10,  15  and  25.  The  sensitivity  of  the  KSVD  algorithm  to  a  mismatch  between  the  assumed  and  true  noise 
variances  is  shown  in  Figure  1  in  [42],  and  the  insensitivity  of  BPFA  to  changes  in  the  noise  variance 
and  to  requiring  knowledge  of  the  noise  variance  is  deemed  an  important  advantage. 


Fig.  1.  From  left  to  right:  the  original  horses  image,  the  noisy  horses  image  with  the  noise  standard  deviation  of 
25,  the  denoised  image  and  the  inferred  dictionary  with  its  elements  ordered  in  the  probability  to  be  used  (from 
top-left).  The  low-probability  dictionary  elements  are  never  used  to  represent  {•'k,  }v=1iy,  and  are  draws  from  the 
prior,  showing  the  ability  of  the  model  to  learn  the  number  of  dictionary  elements  needed  for  the  data. 


It  is  also  important  to  note  that  the  grey-scale  KSVD  results  in  Table  I  were  initialized  using  an  over¬ 
complete  DCT  dictionary,  while  the  RGB  KSVD  results  in  Table  II  employed  an  extensive  set  of  training 
imagery  to  learn  a  dictionary  D  that  was  used  to  initialize  the  denoising  computations.  All  BPFA,  DP- 
BPFA  and  PSBP-BPFA  results  employ  no  training  data,  with  the  dictionary  initialized  at  random  using 


April  17,  2010 


DRAFT 


15 


draws  from  the  prior  or  with  the  SVD  of  the  data  under  test. 

TABLE  I 

Grey-scale  image  denoising  PSNR  results,  comparing  KSVD  [13]  and  BPFA,  using  patch  size 
8x8.  The  top  and  bottom  parts  of  each  cell  are  results  of  KSVD  and  BPFA,  respectively. 


a 

C.man 

House 

Peppers 

Lena 

Barbara 

Boats 

F.print 

Couple 

Hill 

5 

37.87 

39.37 

37.78 

38.60 

38.08 

37.22 

36.65 

37.31 

37.02 

37.32 

39.18 

37.24 

38.20 

37.94 

36.43 

36.29 

36.77 

36.24 

10 

33.73 

35.98 

34.28 

35.47 

34.42 

33.64 

32.39 

33.52 

33.37 

33.40 

36.29 

34.31 

35.62 

34.63 

33.70 

32.42 

33.63 

33.31 

15 

31.42 

34.32 

32.22 

33.70 

32.37 

31.73 

30.06 

31.45 

31.47 

31.34 

34.52 

32.46 

33.93 

32.61 

31.97 

30.23 

31.73 

31.64 

20 

29.91 

33.20 

30.82 

32.38 

30.83 

30.36 

28.47 

30.00 

30.18 

30.03 

33.25 

31.10 

32.65 

31.10 

30.70 

28.72 

30.34 

30.47 

25 

28.85 

32.15 

29.73 

31.32 

29.60 

29.28 

27.26 

28.90 

29.18 

28.99 

32.24 

30.00 

31.63 

29.88 

29.70 

27.58 

29.28 

29.57 

50 

25.73 

27.95 

26.13 

27.79 

25.47 

25.95 

23.24 

25.32 

26.27 

25.67 

28.49 

26.46 

28.29 

26.03 

26.50 

24.14 

25.94 

26.81 

100 

21.69 

23.71 

21.75 

24.46 

21.89 

22.81 

18.30 

22.60 

23.98 

21.93 

24.37 

22.73 

24.95 

22.13 

23.32 

20.44 

23.01 

24.22 

D.  Image  interpolation 

For  the  initial  interpolation  examples,  we  consider  standard  RGB  images,  with  80%  of  the  RGB  pixels 
missing  uniformly  at  random  (the  data  under  test  arc  shown  in  Figure  2).  Results  arc  first  presented  for 
the  Castle  and  Mushroom  images,  with  comparisons  between  the  BPFA  model  in  (6)  and  the  PSBP- 
BPFA  model  discussed  in  Section  IV.  The  difference  between  the  two  is  that  the  former  is  a  “bag-of- 
patches”  model,  while  the  latter  accounts  for  the  spatial  locations  of  the  patches.  Further,  the  PSBP-BPFA 
simultaneously  performs  image  recovery  and  segmentation.  The  results  arc  shown  in  Figure  3,  presenting 
the  mean  reconstructed  images  and  inferred  segmentations.  Each  color  in  the  inferred  segmentation 
represents  one  PSBP  mixture  component,  and  the  figure  shows  the  last  Gibbs  iteration  (to  avoid  issues 
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TABLE  II 

RGB  IMAGE  DENOISING  PSNR  RESULTS  COMPARING  KSVD  [28]  AND  BPFA,  BOTH  USING  A  PATCH  SIZE  OF 
7x7.  The  top  and  bottom  parts  of  each  cell  show  the  results  of  KSVD  and  BPFA,  respectively. 


a 

Castle 

Mushroom 

Train 

Horses 

Kangroo 

5 

40.37 

39.93 

39.76 

40.09 

39.00 

40.34 

39.73 

39.38 

39.96 

39.00 

10 

36.24 

35.60 

34.72 

35.43 

34.06 

36.28 

35.70 

34.48 

35.48 

34.21 

15 

33.98 

33.18 

31.70 

32.76 

31.30 

34.04 

33.41 

31.63 

32.98 

31.68 

25 

31.19 

30.26 

28.16 

29.81 

28.39 

31.24 

30.62 

28.28 

30.11 

28.86 

with  label  switching  between  Gibbs  iterations).  While  the  BPFA  does  not  directly  yield  a  segmentation, 
its  PSNR  results  are  comparable  to  those  inferred  by  PSBP-BPFA,  as  summarized  in  Table  III. 

TABLE  III 

Comparison  of  interpolation  of  the  Castle  and  Mushroom  images,  based  upon  observing  20% 
OF  THE  PIXELS,  SELECTED  UNIFORMLY  AT  RANDOM.  RESULTS  ARE  SHOWN  USING  BPFA  AND  PSBP-BPFA, 
AND  THE  ANALYSIS  IS  SEPARATELY  PERFORMED  USING  8x8x3  AND  5x5x3  IMAGE  PATCHES. 


Castle  8x8x3 

Castle  5x5x3 

Mushroom  8x8x3 

Mushroom  5x5x3 

BPFA 

29.32 

28.48 

31.63 

31.17 

PSBP-BPFA 

29.54 

28.46 

32.03 

31.27 

An  important  additional  advantage  of  Bayesian  models  like  BPFA,  DP-BPFA  and  PSBP-BPFA  is  that 
they  provide  a  measure  of  confidence  in  the  accuracy  of  the  inferred  image.  In  Figure  4  we  plot  the 
variance  of  the  inferred  error  ?y,  computed  via  the  Gibbs  collection  samples. 

To  provide  a  more-thorough  examination  of  model  performance,  in  Table  IV  we  present  results  for 
several  well-studied  grey-scale  and  RGB  images,  as  a  function  of  the  fraction  of  pixels  missing.  All  of 
these  results  are  based  upon  BPFA,  with  DP-BPFA  and  PSBP-BPFA  yielding  similar  results.  Finally,  in 
Table  V  we  perform  interpolation  and  denoising  simultaneously,  again  with  no  training  data  and  without 
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Fig.  2.  Images  with  80%  of  the  RGB  pixels  missing  at  random.  Left:  castle  image,  right:  mushroom  image. 


Fig.  3.  PSBP-BPFA  analysis  with  80%  of  the  RGB  pixels  missing  uniformly  at  random  (see  Figure  2).  The  analysis 
is  based  on  8  x  8  x  3  image  patches,  considering  all  possible  (overlapping)  parches.  For  a  given  pixel,  the  results 
are  the  average  based  upon  all  patches  in  which  it  is  contained.  For  each  example,  recovered  image  based  on  an 
average  of  Gibbs  collection  samples  (left),  and  each  color  representing  one  of  the  PSBP  mixture  components  (right). 


prior  knowledge  of  the  noise  level.  An  example  result  is  shown  in  Figure  5.  To  our  knowledge,  this  is 
the  first  time  such  a  result  has  been  presented. 

For  all  of  the  examples  considered  above,  for  both  grey-scale  and  RGB  images,  we  also  attempted 
a  direct  application  of  matrix  completion  based  on  the  incomplete  matrix  X  sE  RPxN,  with  columns 
defined  by  the  image  patches.  We  specifically  considered  the  algorithm  in  [19],  using  software  from 
Prof.  Candes’  website.  For  most  of  the  examples  considered  above,  even  after  very  careful  tuning  of  the 
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Fig.  4.  Expected  variance  of  each  pixel  for  the  (Mushroom)  data  considered  in  Figure  3. 


Fig.  5.  From  left  to  right:  the  original  barbara256  image,  the  noisy  and  incomplete  barbara256  image  with  the  noise 
standard  deviation  of  15  and  70%  of  its  pixels  missing  at  random,  the  restored  image  and  the  inferred  dictionary 
with  its  elements  ordered  in  the  probability  to  be  used  (from  top-left). 


parameters,  the  algorithm  diverged,  suggesting  that  the  low-rank  assumptions  were  violated.  For  examples 
for  which  the  algorithm  did  work,  the  PSNR  values  were  typically  4  to  5  dB  worse  than  those  reported 
here  for  our  model. 

E.  Interpolation  of  hyperspectral  imagery 

The  basic  BPFA,  DP-BPFA  and  PSBP-BPFA  technology  may  also  be  applied  to  hyperspectral  imagery, 
and  it  is  here  where  these  methods  may  have  significant  practical  utility.  Specifically,  the  amount  of  data 
that  need  be  measured  and  read  off  a  hyperspectral  camera  is  often  enormous.  By  selecting  a  small 
fraction  of  voxels  for  measurement  and  read-out,  selected  uniformly  at  random,  the  quantity  of  data  that 
need  be  handled  is  reduced  substantially.  Further,  one  may  simply  modify  existing  hyperspectral  cameras. 
We  consider  hyperspectral  data  with  106  spectral  bands,  measured  by  the  US  National  Geospatial  Agency 
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TABLE  IV 

Top:  BPFA  gray-scale  image  interpolation  PSNR  results,  using  patch  size  8x8.  Bottom:  BPFA 
RGB  IMAGE  INTERPOLATION  PSNR  RESULTS,  USING  PATCH  SIZE  7x7. 


data  ratio 

C.man 

House 

Peppers 

Lena 

Barbara 

Boats 

F.print 

Man 

Couple 

Hill 

20% 

24.11 

30.12 

25.92 

31.00 

24.80 

27.81 

26.03 

28.24 

27.72 

29.33 

30% 

25.71 

33.14 

28.19 

33.31 

27.52 

30.00 

09.01 

30.06 

30.00 

31.21 

50% 

28.90 

38.02 

32.58 

36.94 

33.17 

33.78 

33.53 

33.29 

35.56 

34.23 

80% 

34.70 

43.03 

37.73 

41.27 

40.76 

39.50 

40.17 

39.11 

38.71 

38.75 

data  ratio 

Castle 

Mushroom 

Train 

Horses 

Kangroo 

20% 

29.12 

31.56 

24.59 

29.99 

29.59 

30% 

32.02 

34.63 

27.00 

32.52 

32.21 

50% 

36.45 

38.88 

32.00 

37.27 

37.34 

80% 

41.51 

42.56 

40.73 

41.97 

42.74 

TABLE  V 

Simultaneous  image  denoising  and  interpolation  PSNR  results  for  BPFA,  considering  the 

Barbara256  image  and  using  patch  size  8x8. 


<7 

10% 

20% 

30% 

50% 

100% 

0 

23.47 

26.87 

29.83 

35.60 

42.94 

5 

23.34 

26.73 

29.27 

33.61 

37.70 

10 

23.16 

26.07 

28.17 

31.17 

34.31 

15 

22.66 

25.17 

26.82 

29.31 

32.14 

20 

22.17 

24.27 

25.62 

27.90 

30.55 

25 

21.68 

23.49 

24.72 

26.79 

29.30 

(NGA).  Because  of  the  significant  statistical  correlation  across  the  multiple  spectral  bands,  the  fraction 
of  data  that  need  be  read  is  further  reduced,  relative  to  grey-scale  or  RGB  imagery.  In  this  example  we 
considered  2%  of  the  voxels,  selected  uniformly  at  random,  and  used  image  patches  of  size  4  x  4  x  106. 
Other  than  the  increased  data  dimensionality,  nothing  in  the  model  was  changed. 

In  Figure  6  we  show  example  (mean)  inferred  images,  at  two  (arbitrarily  selected)  spectral  bands,  as 
computed  via  BPFA.  All  106  spectral  bands  are  analyzed  simultaneously.  The  average  PSNR  for  the  data 
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cube  (size  845  x  512  x  106)  is  30.96  dB.  While  the  PSNR  value  is  of  interest,  for  data  of  this  type  the 
more  important  question  concerns  the  ability  to  classify  different  materials  based  upon  the  hyperspectral 
data.  In  a  separate  forthcoming  paper  we  consider  classification  based  on  the  full  datacube,  and  based 
upon  the  BPFA- inferred  datacube  using  2%  of  the  voxels,  with  encouraging  results  reported.  We  also  tried 
the  low -rank  matrix  completion  algorithm  from  [19]  for  the  hyperspectral  data,  and  even  after  extensive 
parameter  tuning,  the  algorithm  diverged  for  all  hyperspectral  data  considered. 

In  Table  VI  we  summarize  algorithm  performance  on  another  hyperspectral  data  set,  composed  of  210 
spectral  bands.  We  show  the  PSNR  values  as  a  function  of  percentage  of  observed  data,  and  as  a  function 
of  the  size  of  the  image  patch.  Note  that  the  lxl  patches  only  exploit  spectral  information,  while  the 
other  patch  sizes  exploit  both  spatial  and  spectral  information. 

TABLE  VI 

BPFA  HYPERSPECTRAL  IMAGE  INTERPOLATION  PSNR  RESULTS.  FOR  THIS  EXAMPLE  THE  TEST  IMAGE  IS  A 
150  X  150  URBAN  IMAGE  WITH  210  SPECTRAL  BANDS.  RESULTS  ARE  SHOWN  AS  A  FUNCTION  OF  THE 
PERCENTAGE  OF  OBSERVED  VOXELS,  FOR  DIFFERENT  SIZED  PATCHES  ( e.g .,  THE  4x4  CASE  CORRESPONDS  TO 

4  X  4  X  210  “PATCHES”). 


Observed  data  (%) 

1  x  1 

2x2 

3x3 

4x4 

2 

15.34 

21.09 

22.72 

23.46 

5 

17.98 

23.58 

25.30 

25.88 

10 

20.41 

25.27 

26.36 

26.68 

20 

22.22 

26.50 

27.02 

27.16 

F.  Compressive  sensing 

We  consider  a  CS  example  in  which  the  image  is  divided  into  8x8  patches,  with  these  constituting  the 
underlying  data  {xi}i=i  ^  to  be  inferred.  For  each  of  the  N  blocks,  a  vector  of  CS  measurements  yi  = 
Xatj  is  measured,  where  the  number  of  projections  per  patch  is  n,  and  the  total  number  of  CS  projections 
is  n  ■  N .  In  our  first  examples  the  elements  of  T,  are  constructed  randomly,  as  draws  from  Af(0, 1);  many 
other  random  projection  classes  may  be  considered  [2]  (and  below  we  also  consider  optimized  projections 
X,  matched  to  the  dictionary  D).  Each  aq  is  assumed  represented  in  terms  of  a  dictionary  x,  =  Dtry  +  e,., 
and  three  constructions  for  D  were  considered:  (i)  a  DCT  expansion;  (ii)  learning  of  D  using  BPFA, 
using  training  images;  (in)  using  the  BPFA  to  perform  joint  CS  inversion  and  learning  of  D.  For  (ii). 


April  17,  2010 


DRAFT 


21 


the  training  data  consisted  of  4000  8  x  8  patches  chosen  at  random  from  100  images  selected  from  the 
Microsoft  database  ( http  :  / /research.microso ft.com/ en  —  us/ projects /objectclassrecognitiori). 
The  dictionary  was  set  to  K  =  256,  and  the  offline  beta  process  inferred  a  dictionary  of  size  M  =  237. 

Representative  CS  reconstruction  results  are  shown  in  Figure  7  (left)  based  upon  a  DCT  dictionary,  for  a 
grey-scale  version  of  the  “castle”  image.  The  results  in  Figure  7  (right)  are  based  on  a  learned  dictionary; 
except  for  the  “online  BP”  results  (where  D  and  {wi}i=\tN  are  learned  jointly),  all  of  these  results 
employ  the  same  dictionary  D  learned  off-line  as  mentioned  above,  and  the  algorithms  arc  distinguished 
by  different  ways  of  estimating  {wi}i=  A  range  of  CS-inversion  algorithms  arc  considered  from  the 
literature,  and  several  BPFA-based  constructions  arc  considered  as  well  for  CS  inversion.  The  online 
BPFA  results  (with  no  training  data)  are  quite  competitive  with  those  based  on  a  dictionary  learned 
off-line. 

Note  that  results  based  on  a  learned  dictionary  arc  markedly  better  than  those  based  on  the  DCT;  similar 
results  were  achieved  when  the  DCT  was  replaced  by  a  wavelet  representation.  For  the  DCT-based  results, 
note  that  the  DP-BPFA  and  PSBP-BPFA  CS  inversion  results  are  significantly  better  than  those  of  all 
other  CS  inversion  algorithms.  The  results  reported  here  arc  consistent  with  tests  we  performed  using 
over  100  images  from  the  aforementioned  Microsoft  database,  not  reported  here  in  detail  for  brevity. 

In  all  previous  results  the  projection  matrix  X  was  constituted  randomly.  We  now  consider  a  simple 
means  of  matching  X  to  a  D  learned  offline,  based  upon  representative  training  images.  Assume  a 
learned  D  £  RPx  K ,  with  K  >  P ,  which  may  be  represented  via  SVD  as  D  =  UAV7  ;  U  £  RPx  p 
and  V  £  RKxP  are  each  composed  of  orthonormal  columns,  and  A  is  a  P  x  P  diagonal  matrix.  The 
columns  of  U  span  the  linear  subspace  of  Rp  in  which  the  columns  of  D  reside.  Further,  since  the 
columns  of  D  are  generally  not  orthonormal,  each  column  of  D  is  “spread  out”  when  expanded  in  the 
columns  of  U.  Therefore,  one  expects  that  U  and  D  are  incoherent.  Hence,  a  simple  means  of  matching 
CS  projections  to  the  data  is  to  define  the  rows  of  X  in  terms  of  randomly  selected  columns  of  U.  This 
was  done  in  Figure  8  for  the  grey-scale  “castle”  image,  using  the  same  learned  dictionary  as  considered 
in  Figure  7.  It  is  observed  that  this  procedure  yields  a  marked  improvement  in  CS  recovery  accuracy,  for 
all  CS  inversion  algorithms  considered. 

Concerning  computational  costs,  all  CS  inversions  were  run  efficiently  on  PCs,  with  the  specifics 
computational  times  dictated  by  the  detailed  Matlab  implementation  and  the  machine  run  on.  A  rough 
ranking  of  the  computational  speeds,  from  fastest  to  slowest,  is  as  follows:  StOMP-CFAR,  Fast  BCS, 
OMP,  BPFA,  LARS/Lasso,  Online  BPFA,  DP-BPFA,  PSBP-BPFA,  VB  BCS,  Basis  Pursuit;  in  this  list, 
algorithms  BPFA  through  Basis  Pursuits  have  approximately  the  same  computational  costs. 
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VI.  Conclusions 

The  truncated  beta-Bernoulli  process  has  been  employed  to  learn  dictionaries  matched  to  image 
patches  {aq}j— The  basic  nonparametric  Bayesian  model  is  termed  a  beta  process  factor  analysis 
(BPFA)  framework,  and  extensions  have  also  been  considered.  Specifically,  the  Dirichlet  process  (DP)  has 
been  employed  to  cluster  the  {xi}i=\^,  encouraging  similar  dictionary-element  usage  within  respective 
clusters.  Further,  the  probit  stick-breaking  process  (PSBP)  has  been  used  to  impose  that  proximate 
patches  are  more  likely  to  be  clustered  similarly  (imposing  that  they  are  more  probable  to  employ  similar 
dictionary  elements).  All  inference  has  been  performed  by  a  Gibbs  sampler,  with  analytic  update  equations. 
The  PBFA,  DP-BPFA  and  PSBP-BPFA  have  been  applied  to  three  problems  in  image  processing:  (i) 
denoising,  ( ii )  image  interpolation  based  upon  a  subset  of  pixels  selected  uniformly  at  random,  and 
(Hi)  learning  dictionaries  for  compressive  sensing  and  also  compressive  sensing  inversion.  We  have  also 
considered  jointly  performing  (i)  and  (ii).  Important  advantages  of  the  proposed  methods  arc:  (i)  a  full 
posterior  on  model  parameters  are  inferred,  and  therefore  “error  bars”  may  be  placed  on  the  inverted 
images;  (ii)  the  noise  variance  need  not  be  known,  and  is  inferred  within  the  analysis  and  may  be 
nonstationary;  (Hi)  while  training  data  may  be  used  to  initialize  the  dictionary  learning,  this  is  not 
needed,  and  the  BPFA  results  are  highly  competitive  even  based  upon  random  initializations.  In  the 
context  of  compressive  sensing,  the  DP-BPFA  and  PSBP-BPFA  results  arc  state  of  the  art,  significantly 
better  than  existing  published  methods.  Finally,  based  upon  the  learned  dictionary,  a  simple  method  has 
been  constituted  for  optimizing  the  CS  projections. 

The  interpolation  problem  is  related  to  CS,  in  that  we  exploit  the  fact  that  {xi}i= reside  on  a  low¬ 
dimensional  nonlinear  subspace  of  Mp,  such  that  the  total  number  of  measurements  is  small  relative  to 
N  ■  P  (recall  xr  £  Rp).  However,  in  CS  one  employs  projection  measurements  £.%•.,,  where  £  £  MnxP, 
ideally  with  n  <C  P.  The  interpolation  problem  corresponds  to  the  special  case  in  which  the  rows  of  £ 
arc  randomly  selected  rows  of  the  P  x  P  identity  matrix.  This  problem  is  closely  related  to  the  problem 
of  matrix  completion  [6],  [22],  [35],  where  the  incomplete  matrix  X  £  RPxiV  has  columns  defined  by 
{xi}i=itN.  However,  the  {.x,}J=iv  reside  in  a  nonlinear  subspace  of  Mp,  while  low-rank-based  methods 
assume  that  the  data  reside  in  a  low-dimensional  linear  subspace.  We  showed  that  if  the  {ajj}j=ipv  may 
be  clustered,  manifesting  a  union-of-subspace  model,  then  matrix  completion  theory  may  be  employed 
relatively  simply  to  place  bounds  on  anticipated  accuracy  of  recovered  missing  data. 

While  the  PSBP-BPFA  successfully  segmented  the  image  while  recovering  missing  data,  we  found 
that  the  PSNR  performance  of  direct  BPFA  analysis  performed  very  close  to  that  of  PSBP-BPFA.  This 
suggests  that  the  properties  of  the  beta-Bernoulli  process  (recall  the  Indian  buffet  process  discussion 
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in  Section  III-A)  naturally  manifests  an  effective  clustering  of  these  data.  An  important  area  of  future 
research  is  to  extend  the  matrix-completion  theory  to  the  case  for  which  the  columns  of  X  come  from 
a  general  non-linear  subspace  of  Mp,  such  as  that  impose  via  the  beta-Bernoulli  prior. 

Appendix:  Gibbs  Sampling  Inference 

The  Gibbs  sampling  update  equations  arc  given  below;  we  provide  the  update  equations  for  the  BPFA. 
and  the  DP  and  PSBP  versions  arc  relatively  simple  extensions.  Below,  X,  represents  the  projection 
matrix  on  the  data,  for  image  patch  x j.  For  the  CS  problem,  Xj  is  typically  fully  populated,  while  for 
the  interpolation  problem  each  row  of  X,  is  all  zeros  except  for  a  single  one,  corresponding  to  the 
specific  pixel  that  is  measured.  The  update  equations  are  the  conditional  probability  of  each  parameter, 
conditioned  on  all  other  parameters  in  the  model. 

Sample  dk 

TV 

p(dk\—)  OC  JJ N{Vi\ X;D(s;  ©  Zi),  7e_1I|)Si ||0 )AA(dfc;  0,  P_1Ip) 

1=1 

It  can  be  shown  that  dk  can  be  drawn  from  a  normal  distribution 

P{dk |-)  ~  A/'(/rdfe,XdJ 

with  the  covariance  Xd)  and  mean  fidi  expressed  as 

/  TV 

Xdfc=  Ft  +  7eE44xfXi 

V  i=i 

TV 

P'dk  7 e^dk  ^  ]  ziksikxi 

i=l 

where 

=  Xft/j  -  X/  X,I);.s,  ©  +  XfXidfc(sifc  ©  zik). 

Sample  zk:  =  [zlk,  z2k,---  ,  zNk\ 

p(zik |-)  OC  A f{yt;  X.tD(sj  ©  Zj),7£_1I||s,||0)Bemoulli(^jk;  nk) 

The  posterior  probability  that  zlk  =  1  is  proportional  to 

pi  =  irk  exp  “(sffcdfcXfXidfc  -  2 sikdlxjk) 
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and  the  posterior  probability  that  Zik  =  0  is  proportional  to 

Po  =  1  - 

so  Zjj.  can  be  drawn  from  a  Bernoulli  distribution  as 

~  Bernoulli! — — — ). 

Po  +  Pi 


Sample  sk:  =  [sr.,  s2k,  •  •  •  ,  SNk] 


P{sik\~ )  ocA/'(yi;SiD(si©zi),7e  1I||si||o)A/'(si;0,7s  4^) 


It  can  be  shown  that  .7/,.  can  be  drawn  from  a  normal  distribution 

P{sik\~)  ~  Af{nSik,T,Sik) 

with  the  variance  and  mean  fiSik  expressed  as 

^ Sik  =  (7s  +  7 ezikdk  i  Sidfc) 

Psik  =  'Je^SikZikd'k^i  • 

Note  7/.  is  equal  to  either  1  or  0,  Ss.fc  and  can  be  further  expressed  as 


(Ts  ”1“  7e^fc  ^7  if  7fc  —  1 

if  %ik  ii 


^ Sik  — 


Psik  — 


7,  1 


>:,  .  dk  £?  £,-x;  if  Zik  =  1 


0 


if  —  0 


Sample  717 


N 

p{ 7Tfc|— )  oc  Beta(7Tfc;a,6)  PjBemoulli(2:ife;7rfc) 
It  can  be  shown  that  77,  can  be  drawn  from  a  Beta  distribution  as 

iV  ,  -x  iV 


p(TTfcl-)  ~  Beta(-^r  +  ^2  zik,  b°^Kr  ^  +  N  -  ^7fc) 


2=1 


2=1 


Sample  7 s 


N 

p(7s|-)  oc  r(7s;c0,d0)  JjAf(s7  0,  771jA') 
2=1 


(14) 


(15) 
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It  can  be  shown  that  7.5  can  be  drawn  from  a  Gamma  distribution  as 


Sample  7e 


N 


P( Te|—)  oc  r(7e;e0,  /o)  ^‘D(Si  0  **).7e“ll||Si 


(16) 


i=  1 


(17) 


It  can  be  shown  that  y£  can  be  drawn  from  a  Gamma  distribution  as 

/  1  N  1  N  \ 

P(  7e|-)  ~  r  e0  +  llS*llo,/o+  g  2  ©Zj)k  • 

V  Z  i=l  z  i=l  / 

Note  that  X,  X,;  is  a  sparse  identity  matrix,  X^  is  a  diagonal  matrix,  and  Z  is  a  sparse  matrix,  it  is 
easy  to  find  that  only  basic  arithmetical  operations  are  needed  and  many  unnecessary  calculations  can 
be  avoided,  leading  to  fast  computation  and  low  memory  requirement. 
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Fig.  6.  Comparison  of  recovered  band  (average  from  Gibbs  collection  iterations)  for  hyperspectral  imagery  with 
106  spectral  bands.  The  interpolation  is  performed  using  2%  of  the  hyperspectral  datacube,  selected  uniformly  at 
random.  The  analysis  employs  4  x  4  x  106  patches.  All  spectral  bands  are  analyzed  at  once,  and  here  the  data 
(recovered  and  original)  are  shown  (arbitrarily)  for  bands  1  (top)  and  50  (bottom).  Results  are  computed  using  the 
BPFA  model. 
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Fig.  7.  Left:  Compressive  sensing  (CS)  results  on  grey-scale  Castle  image,  based  on  a  DCT  dictionary  D.  The  CS 
projection  matrix  £  is  constituted  randomly,  with  elements  drawn  iid  from  W(0,1).  Results  are  shown  using  the 
DP-BPFA  and  PSBP-BPFA  models  in  Section  IV.  Comparisons  are  also  made  with  several  CS  inversion  algorithms 
from  the  literature.  Right:  Same  as  on  the  left  but  based  on  a  learned  dictionary  D  instead  of  DCT.  The  online  BP 
results  employ  BPFA  to  learn  D  and  do  CS  inversion  jointly.  All  other  results  are  based  upon  a  learned  D  with 
learning  performed  offline  using  distinct  training  images. 


Fig.  8.  Compressive  sensing  (CS)  results  on  grey-scale  Castle  image,  based  on  a  learned  dictionary  D  (learning 
performed  offline,  using  distinct  training  data).  The  projection  matrix  £  is  matched  to  D,  based  upon  an  SVD  of 

D. 
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